In this study, I benefit from

https://www.kaggle.com/new-york-city/nyc-property-sales

https://www1.nyc.gov/assets/finance/downloads/pdf/07pdf/glossary_rsf071607.pdf and

https://www1.nyc.gov/assets/finance/jump/hlpbldgcode.html

Initialization

Packages

Functions

Operational functions

Calculation functions

Graph functions

Outlier functions

Model functions

Initial Look-Up

Perform exploratory analysis on this dataset and produce a showcase/storyline of a few interesting patterns and your observations. You will walk us through your findings during our interview.

So, columns are like:

Dates covered: between 2016-09-01 and 2017-08-31 (12 months)

Unique Key

It seems that Borough-Block-Lot is not a key in this data sets ( not unique for eacch record). This is ok, since this is a transaction data and a unit can be sold many times. For instance, bbl = '1-373-40' at V1 = 16 and V=17. Their address is the same.

On the other hand, there can be other units in the same bbl. For instance, bbl = '1-373-40' at V1 = 17, V1 = 18 and V1 = 19. All are different apartments.

To understand whether two units are the same or different in a bbl, we can befeit from address column if it is complete.

However, V1 and Borough provides a unique key in the data set.

So, I'll proceed with transforming column types to analyze data further

Transforming the columns

Now, we can talk about data better.

Preliminary Analysis

Since there are saleprice = 0 or NA values and I want to eliminate them, I'll try to understand the structure of the buildings first. So, I'll play with data without any focus on saleprice.

Correlations

Unit Relationships

Tax Class Relationships

We can consider either taxclass_present or taxclass_past as they are strongly correlated.

There is no correlation between commercial units & tax class.

negative correlation between taxclass and borough means that manhattan , for instance (borough = 1) includes mostly tax class 4 buildings.

Square Feet Columns

All Data _ Correlation

Let's see what happens if there were no missing info in the data set.

Let's meet the data

Clean the data

Now, we'll focus on saleprice. Some records are transaction between family members. Therefore, the saleprice for such transactions are 0 or very small.

Since such records cannot be used to understand the value of a construction/building, I eliminate na, 0 values and very small values.

yearbuilt

building class

Also, we know that taxclass_present and taxclass_past are highly correlated. taxclass_present has about 500 NA records. I believe these are demolished buildings. So, I decided to eliminate taxclass_past completely and only na values of taxclass_present

Missing Values

Other Features

I'll try to extract some info from features listed below.

Focus on features

building category

Building class and building category seem to be relevant. There is no such a 1-1 kind of mapping between themm. Yet, for instance, b_class = a is always category 1.

In the beginning I though I might impute building category by using b_class_present. However most missing values of category are c,d, and r of b_class. These classes do not have enoght number of observations with complete category info.

Apart from that, only b_class_cat 1 and 2 have a meaningful presence in the data set and b_class_present can cover this as well.

So, I'll ignore building category in the calculations in the further steps.

Nevertheless, I just want to explore building category more.

So, luxuryhotels have the highet mean but there are only 8 records. Light Blue colored cell is on the top right of the treemap.

It really seems that we cannot benefit from building category regarding price prediction.

Grouping building classes

Address

Maybe we can obtain some info from address column. Let's see.

new address info can be useful in prediction of small house prices.

Neighborhood

The size of the rectangules implies the number of data points, while the color shows the mean sale price in the neighborhood.

Distributions

midtowncbd includes the terrific record with 210000000000 saleprice.

t(dt[saleprice > 2000000000])

WRT SalePrice

Saleprice quartiles of each borough seems different. Especially look at borought 1 and 2.

Not an obvious finding regarding yearbuilt. But it can be connected with other features.

Filling gross square feet

Several Transformations

Grouping buildings

Outlier in SalePrice

The nicest shape belongs to saleprice_log_wo :)

Any pattern in a year?

Any pattern explicable with week effect?

Check the reliability of the data set

Totalunits = 0 and na square feet columns are very related.

Check (residentialunits + commercialunits != totalunits)

Let's check if the sum of residentialunits and commercialunits gives us the total units. It seems that inconsistency is present mostly at tax class 4 ( which has the lowest tax rate https://www1.nyc.gov/site/finance/taxes/property-bills-and-payments.page). Moreover, most of these records has R_type building class which are like parking and storage spaces.

Zero-zipcodes

NA area

Most of NA landsquarefeet and grosssquarefeet happen when commercial units are equal to 0.

Additional Columns

Extra info

Encoding

logs

Modeling

Lasso

Log + outlier

KNN

Since some problem occured with caret package, unfortunately, I am not able to share the results.

SVM

6 min for the 1st chunk

XGBoost (Base)

A general model with everything vs different models for borough and building type.

If we remove zipcode and keep grosssquarefeet_log_group_encoded rmse increases only a little bit.

Based on borough

Predictions are generally below actual sale prices. It seems that outliers in the training set causes problems.

But before that, lets try to address each borough and building class pair separately.

XGBoost Based on Borough & Building Class

If we study partial models, it is important that we eliminated the outliers because with target = saleprice, rmse increases drastically.

XGBoost Based on Borough & Building Class W/O Outliers in Training

saleprice with zipcode

saleprice_log with zipcode

saleprice_log without zipcode

Deep dive into 'others'

XGBoost with TimeSeries Perspective

Final words

Best model seems to be saleprice_log with zipcode, trained without outliers.

zipcode improves the model performance.

outliers in he prediction set causes some sort of tradeoff.

  1. improve the test performance: predict outlier observations better in the expense of predicting the regular data points worse

  2. improve the prediction accuracy of as many data points as possible : diverge from the outlier observations and predict regular data points better

There should be some sort of distinction between outlier and regular data points. However, I could not find it. Therefore, I could not predict them in separate models. That's why I could not escape the tradeoff I mentioned above.

Special : Why bad Prediction?

All are located close by an avenue.

All have residential and commercial units

I could not find a common property of outlier points :(

SQL